In the last tutorial, we acted as data engineers, collected the data, and transformed the data into a more convenient form so that the data analyst and scientist can use it. Now, the modified data is delivered to a data analyst.

The job of a data analyst is to transform data into information- any insight that helps in achieving the higher goal, which in our case is STLF (Short Term Load Forecasting). The useful information extracted will help the data scientist to create robust models. So another thing necessary for a data analyst is to communicate the results in the most effective ways.

As you might have already observed that this tutorial is neither a word document nor a PDF. The file type of this document is HTML. All the webpages that you browse on your internet browser are HTML files. This file should also be opened in your internet browser. I created this report in the same format. In fact, I wrote all the tutorials in the same format. This provides us with the best data visualization tools available. Rstudio calls such reports the ‘R Notebook.’ R Notebooks can be converted to PDFs, word documents, and even PowerPoint slides. It is a beautiful way of creating data analysis reports. You can also add videos to your report. Below is a video tutorial to get started with R Notebooks.


You can also download the code to create this report from the button on the top right labeled as ‘Code.’


In the last tutorial, you created a script to read and process data to be ready for further analysis. With this tutorial, you will find “Houses.csv” which contains the total usage of the nine houses specified in the metadata file. We will analyze this data in this tutorial.

Houses = read.csv("Houses.csv")

Houses$Date_Time = as.POSIXct(Houses$Date_Time)

dim(Houses)
## [1] 8760   10

The houses.csv has ten columns and 8760 rows.

str(Houses)
## 'data.frame':    8760 obs. of  10 variables:
##  $ Date_Time: POSIXct, format: "2018-06-01 00:00:00" "2018-06-01 01:00:00" ...
##  $ House14  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ House15  : num  4.9363 0.0216 0.0216 0.0216 0.9376 ...
##  $ House18  : num  1.404 1.383 0.815 0.974 1.318 ...
##  $ House2   : num  0.805 0.814 0.776 0.613 0.595 ...
##  $ House21  : num  0.816 1.399 2.59 2.402 2.364 ...
##  $ House26  : num  3.57 3.15 3.16 3.26 2.88 ...
##  $ House39  : num  2.23 2.14 2.17 2.1 1.95 ...
##  $ House4   : num  3.72 3.78 3.43 3.32 3.15 ...
##  $ House9   : num  0.765 0.399 0.34 0.39 1.039 ...

The first column is Date_Time, which contains hourly date\time labels. The rest of the nine columns is the hourly electricity consumption of nine households. The unit of these columns is kW.

summary(Houses)
##    Date_Time                      House14           House15        
##  Min.   :2018-06-01 00:00:00   Min.   :0.00000   Min.   :0.000797  
##  1st Qu.:2018-08-31 05:45:00   1st Qu.:0.07953   1st Qu.:0.722849  
##  Median :2018-11-30 11:30:00   Median :0.15190   Median :1.123162  
##  Mean   :2018-11-30 11:30:00   Mean   :0.28536   Mean   :1.546613  
##  3rd Qu.:2019-03-01 17:15:00   3rd Qu.:0.32116   3rd Qu.:2.024532  
##  Max.   :2019-05-31 23:00:00   Max.   :3.71810   Max.   :9.153413  
##     House18            House2          House21          House26      
##  Min.   :0.00002   Min.   :0.0000   Min.   :0.0001   Min.   :0.0005  
##  1st Qu.:0.28800   1st Qu.:0.1657   1st Qu.:0.3578   1st Qu.:0.5133  
##  Median :0.43756   Median :0.3245   Median :0.5542   Median :0.7940  
##  Mean   :0.63221   Mean   :0.4111   Mean   :0.7428   Mean   :1.0076  
##  3rd Qu.:0.84921   3rd Qu.:0.5568   3rd Qu.:0.8100   3rd Qu.:1.1825  
##  Max.   :2.53655   Max.   :3.4504   Max.   :4.5927   Max.   :4.5039  
##     House39           House4            House9        
##  Min.   :0.0001   Min.   :0.00032   Min.   :0.000032  
##  1st Qu.:0.3840   1st Qu.:0.98107   1st Qu.:0.145839  
##  Median :0.6132   Median :1.39622   Median :0.259798  
##  Mean   :0.6704   Mean   :1.65321   Mean   :0.482208  
##  3rd Qu.:0.8969   3rd Qu.:2.10885   3rd Qu.:0.563267  
##  Max.   :2.9461   Max.   :7.00872   Max.   :5.874098

Explain the summary above.

Another way to display the same information is a boxplot. See the interactive boxplots below (Hover your cursor over the plot).

library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## Warning: package 'ggplot2' was built under R version 4.0.4
library(reshape)
## Warning: package 'reshape' was built under R version 4.0.5
houses_melt = melt(Houses[,-1])

ggplotly(
ggplot(houses_melt)+
  geom_boxplot(aes(x = variable, y = value))+
  labs(x="Houses", y = "Usage [kW]")
)

Explain the code above.

List down all the useful information that you can extract from these boxplots?

  ggplotly(
  ggplot(Houses, aes(x = Date_Time))+
  geom_line(aes(y = House2, color = "House 2"))+
  geom_line(aes(y = House21, color = "House 21"))+
  geom_line(aes(y = House26, color = "House 26"))+
  geom_line(aes(y = House39, color = "House 39"))+
  geom_line(aes(y = House4, color = "House 4"))+
  geom_line(aes(y = House9, color = "House 9"))+
  geom_line(aes(y = House14, color = "House 14"))+
  geom_line(aes(y = House15, color = "House 15"))+
  geom_line(aes(y = House18, color = "House 18"))+ 
  theme(legend.title = element_blank()) + labs(x = "Date\\Time", y = "USage [kW]")
  )

The plot above shows all the data plotted. The y-axis shows the electricity load of each household, and the x-axis shows the Date\time. Click on the legend labels on the right to add or remove households from the plot. Double click on any house from the legend to isolate its plot.

Observe that for every house; electricity consumption is high for summer months and low for winter months. Electricity consumption in Pakistan is highly dependent on the weather.

Isolate House 2, what is strange about this house’s electricity consumption? What is a possible explanation?

Isolate House 14, what is strange about this house’s electricity consumption? What is a possible explanation?

One of the most interesting houses is House 15; it has the highest variation in electricity consumption. Let’s isolate House 15 for further analysis.

House_15 = Houses[,c(1,3)]

House_15$Month = paste(months(House_15$Date_Time, abbreviate = TRUE) ,
                       as.POSIXlt(House_15$Date_Time)$year+1900)

ggplotly(
ggplot(House_15)+
  geom_boxplot(aes(x = Month, y = House15))+
  labs(x="Houses", y = "Usage [kW]")+
  theme(axis.text.x = element_text(angle = 45))+
  scale_x_discrete(limits= c(paste(month.abb[6:12], "2018"), 
                             paste(month.abb[1:5], "2019")))
)

The above box plot shows the energy consumption pattern of House 15 for each month.

Create a similar boxplot for each hour pattern for House 4. I have attached a snapshot of the plot for your guidance.

Hourly Boxplot

There is a clear pattern followed by House 4 every day, as shown in the hourly boxplot. The electricity consumption is high at night and low in the day. Most of the residents of the household might have a routine of leaving the house at 7 or 8 am and return home at 8 or 9 pm. This information can be used for STLF.

Is there any other information that you can deduct from the plots above.

I have a thesis to be further explored, to be beneficial for STLF:

Electricity Consumption of every hour depends on the electricity consumption of the previous hour

I deduced this statement from the Hourly Boxplot. As House 4 follows a daily pattern, the electricity consumption of an hour can be approximated from the electricity consumption of the same hour of the previous day.

Create an R markdown report, doing a similar analysis for all households.

Explain patterns that you observe in these household’s electricity consumption

Email me your detailed report.